Binning: Converting Numerical Classification into Text Classification

نویسندگان

  • Aynur A. Dayanik
  • Sofus A. Macskassy
  • Haym Hirsh
چکیده

Consider a supervised learning problem in which examples contain both numericaland text-valued features. One common approach to this problem would be to treat the presence or absence of a word as a Boolean feature, which when combined with the other numerical features enables the application of a range of traditional feature-vector-based learning methods. This paper presents an alternative approach, in which numerical features are converted into “bag of word” features, enabling instead the use of a range of existing text-classification methods. Our approach creates a set of bins for each feature into which its observed values can fall. Two tokens are defined for each bin endpoint, representing which side of a bin’s endpoint a feature value lies. A numerical feature is then assigned the bag of tokens appropriate for its value. Not only does this approach now make it possible to apply text-classification methods to problems involving both numerical and text-valued features, even problems that contain solely numerical features can be converted using this representation so that text-classification methods can be applied. We therefore evaluate our approach both on a range of real-world datasets taken from the UCI Repository that solely involve numerical features, as well as on additional datasets that contain both numericaland text-valued features. Our results show that the performance of the text-classification methods using the binning representation often meets or exceeds that of traditional supervised learning methods (C4.5, k-NN, NBC, and Ripper), even on existing numericalfeature-only datasets from the UCI Repository, suggesting that text-classification methods, coupled with binning, can serve as a credible learning approach for traditional supervised learning

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Converting numerical classification into text classification

Consider a supervised learning problem in which examples contain both numericaland textvalued features. To use traditional feature-vector-based learning methods, one could treat the presence or absence of a word as a Boolean feature and use these binary-valued features together with the numerical features. However, the use of a text-classification system on this is a bit more problematic—in the...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000